This article’s goal is to explore the possibilities of unsupervised learning methods in Data Science. I am currently enrolled on 1st semester of CEE’s Best Data Science Programme, MSc in Data Science at University of Warsaw. By this time, I can tell, that topics like dimension reduction (e.g. MDS, PCA, t-SNE) and clustering (kMeans, PAM, DBSCAN) are fundamental skills to acquire if you wish to have a successfull career in Data Science or Machine Learning.
Datasets tend to be unlabeled, noisy or just too big. Given the first rule of modelling – “garbage in, garbage out”, you need to master the skills of data preparation in order to build the best models and give people the best recommendations. Here, I will work on a well known Wine Dataset from UCI Machine Learning Repository also available on Kaggle. Partly because I am a huge fan of Sauvignon Blanc and Chardonnay and partly because I think this data is great for explaining how those aforementioned techniques work. This Primier is about to deliver you the absolute best practices in this area, and is written to serve you as one-stop-shop for unsupervised learning basics.
Happy Reading!
This Dataset consists of several numeric variables, that describes the wine, and two categorical variables, which classifies the wine. I will be using techniques, that usually help you label the data and gather them in similar groups (similar points merge on a plane). That’s why, I am not going to use the color and quality labels as inputs for the models. Nevertheless, during the evaluation it might be helpful to check, how well the wines are labeled (or classified) into groups. Let’s say, that in a perfect solution, we would end up with two main groups, where majority of wines are in the same color, and in those two groups, there will be structures visible, that mirror the quality of wines.
To sum up, the wines are categorized currently by the:
Type/Color; string representing “red” or “white”,
Quality; integer ranging from 3 (worst) to 9 (best).
However I will analyse only the numeric variables, as you can see below.
wine.data = read.csv(file = "WineQuality.csv", header = T)
paste0("Number of rows, including NAs: ", dim(wine.data)[1])
## [1] "Number of rows, including NAs: 6497"
wine.data = wine.data %>%
na.omit()
paste0("Number of rows, excluding NAs: ", dim(wine.data)[1])
## [1] "Number of rows, excluding NAs: 6463"
wine.data = wine.data %>%
as.data.table()
var.list = colnames(wine.data[,2:(ncol(wine.data)-1)])
paste0(var.list)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol"
str(wine.data)
## Classes 'data.table' and 'data.frame': 6463 obs. of 13 variables:
## $ type : Factor w/ 2 levels "red","white": 2 2 2 2 2 2 2 2 2 2 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## - attr(*, "na.action")= 'omit' Named int 18 34 55 87 99 140 175 225 250 268 ...
## ..- attr(*, "names")= chr "18" "34" "55" "87" ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(wine.data)
## type fixed.acidity volatile.acidity citric.acid
## red :1593 Min. : 3.800 Min. :0.0800 Min. :0.0000
## white:4870 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500
## Median : 7.000 Median :0.2900 Median :0.3100
## Mean : 7.218 Mean :0.3396 Mean :0.3188
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900
## Max. :15.900 Max. :1.5800 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 1.00
## 1st Qu.: 1.800 1st Qu.:0.03800 1st Qu.: 17.00
## Median : 3.000 Median :0.04700 Median : 29.00
## Mean : 5.444 Mean :0.05606 Mean : 30.52
## 3rd Qu.: 8.100 3rd Qu.:0.06500 3rd Qu.: 41.00
## Max. :65.800 Max. :0.61100 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.: 77.0 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300
## Median :118.0 Median :0.9949 Median :3.210 Median :0.5100
## Mean :115.7 Mean :0.9947 Mean :3.218 Mean :0.5311
## 3rd Qu.:156.0 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000
## Max. :440.0 Max. :1.0390 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.30 Median :6.000
## Mean :10.49 Mean :5.819
## 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :14.90 Max. :9.000
wine.dataMelted = melt(wine.data)
## Warning in melt.data.table(wine.data): To be consistent with reshape2's
## melt, id.vars and measure.vars are internally guessed when both are 'NULL'.
## All non-numeric/integer/logical type columns are considered id.vars, which
## in this case are columns [type]. Consider providing at least one of 'id' or
## 'measure' vars in future.
## Warning in melt.data.table(wine.data): 'measure.vars' [fixed.acidity,
## volatile.acidity, citric.acid, residual.sugar, ...] are not all of the same
## type. By order of hierarchy, the molten data value column will be of type
## 'double'. All measure variables not of type 'double' will be coerced too.
## Check DETAILS in ?melt.data.table for more on coercion.
colnames(wine.dataMelted) = c("Type", "Variable", "Value")
wine.dataMelted %>%
ggplot(aes(Value, fill = Type)) + geom_histogram(aes(y = ..density..)) +
geom_density(alpha = .5, col = "transparent", fill=viridis(10)[1]) +
facet_wrap(~Variable, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the first glipmse at the data, I can tell, that there a lot of outliers (long right tails of densities), and there are big differnces in distributions for each variables for white and red wines. That is why, I decide to analyze them side by side, as well as the full dataset for comparison. Dimension Reduction and Clustering is all about the distance between the variables, how much variability they consists of and how close the data points are to each other. Given that, not only dividing the data into “red” and “white” subset is mandatory, but scaling as well.
wine.red = wine.data %>%
filter(type == "red") %>%
within(rm("type", "quality")) %>%
scale() %>%
as.data.table()
wine.whi = wine.data %>%
filter(type == "white") %>%
within(rm("type", "quality")) %>%
scale() %>%
as.data.table()
wine.all = wine.data %>%
within(rm("type", "quality")) %>%
scale() %>%
as.data.table()
colnames(wine.red) = colnames(wine.whi) = colnames(wine.all) = var.list
wine.all.noID = wine.all
wine.red.noID = wine.red
wine.whi.noID = wine.whi
wine.red$ID = seq.int(nrow(wine.red))
wine.whi$ID = seq.int(nrow(wine.whi))
wine.all$ID = seq.int(nrow(wine.data))
corrplot::corrplot(cor(wine.all.noID),
method = "color", col = viridis(10),
type = "upper", order = "hclust",
addCoef.col = "black", tl.col = "black", tl.pos = "lt", tl.srt = 45,
sig.level = .05, insig = "blank",
diag = F, title = "Correlation plot of the variables for All Wines")
corrplot::corrplot(cor(wine.red.noID),
method = "color", col = viridis(10),
type = "upper", order = "hclust",
addCoef.col = "black", tl.col = "black", tl.pos = "lt", tl.srt = 45,
sig.level = .05, insig = "blank",
diag = F, title = "Correlation plot of the variables for Red Wines")
corrplot::corrplot(cor(wine.whi.noID),
method = "color", col = viridis(10),
type = "upper", order = "hclust",
addCoef.col = "black", tl.col = "black", tl.pos = "lt", tl.srt = 45,
sig.level = .05, insig = "blank",
diag = F, title = "Correlation plot of the variables for White Wines")
wine.all.melted = melt(wine.all[,1:(ncol(wine.all)-1)])
colnames(wine.all.melted) = c("Variable", "Value")
ggplot(wine.all.melted, aes(x = Variable, y = Value))+
geom_boxplot(outlier.alpha = .5,
color = viridis(11)) +
ggtitle("Cumulative Box-Plot for (scaled) Variables in all Wines")
wine.red.melted = melt(wine.red[,1:(ncol(wine.red)-1)])
colnames(wine.red.melted) = c("Variable", "Value")
ggplot(wine.red.melted, aes(x = Variable, y = Value))+
geom_boxplot(outlier.alpha = .5,
color = viridis(11)) +
ggtitle("Cumulative Box-Plot for (scaled) Variables in red Wines")
wine.whi.melted = melt(wine.whi[,1:(ncol(wine.whi)-1)])
colnames(wine.whi.melted) = c("Variable", "Value")
ggplot(wine.whi.melted, aes(x = Variable, y = Value))+
geom_boxplot(outlier.alpha = .5,
color = viridis(11)) +
ggtitle("Cumulative Box-Plot for (scaled) Variables in white Wines")
At this point, what I would like to make you aware of are the outliers. As I mentioned, PCA or kMeans magic is based on the distances and are linear approaches. That is why, you always should take a step back, when analysing your outliers data. And as we treat this article as a handbook, we will skip the outlier removal part, and see whether the results on full database are going to be robuts. But for the sake of curiosity, if I would like to apply technique to remove outliers, I would use the rule of a thumb of Mean +/- 1.5 IQR. There are multiple options to detect delete, perhaps by using functions within extra packages, but for me it seemed as a quick win, to make a nested loop as below.
In this section I will run parallel PCAs on all wines, as well as the red and the white ones. Logically, different types of wine, should have different drivers for the output. Moreover, the PCA is a great step before moving into clustering. PCA in my opinion is somehow related to feature engineering. We are creating 11 Principal Components (new variables), which sometimes could be explained theoretically and practically.
all.pca.1 = prcomp(wine.all.noID, center=TRUE, scale.=TRUE)
whi.pca.1 = prcomp(wine.whi.noID, center=TRUE, scale.=TRUE)
red.pca.1 = prcomp(wine.red.noID, center=TRUE, scale.=TRUE)
summary(all.pca.1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.7410 1.5795 1.2469 0.98540 0.84864 0.77861
## Proportion of Variance 0.2756 0.2268 0.1413 0.08827 0.06547 0.05511
## Cumulative Proportion 0.2756 0.5024 0.6437 0.73197 0.79744 0.85256
## PC7 PC8 PC9 PC10 PC11
## Standard deviation 0.72330 0.7082 0.58049 0.47689 0.18101
## Proportion of Variance 0.04756 0.0456 0.03063 0.02067 0.00298
## Cumulative Proportion 0.90012 0.9457 0.97635 0.99702 1.00000
summary(whi.pca.1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.795 1.2540 1.1053 1.00961 0.98695 0.96823 0.85267
## Proportion of Variance 0.293 0.1429 0.1111 0.09266 0.08855 0.08523 0.06609
## Cumulative Proportion 0.293 0.4359 0.5470 0.63963 0.72818 0.81341 0.87950
## PC8 PC9 PC10 PC11
## Standard deviation 0.77518 0.64380 0.53797 0.14378
## Proportion of Variance 0.05463 0.03768 0.02631 0.00188
## Cumulative Proportion 0.93413 0.97181 0.99812 1.00000
summary(red.pca.1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7603 1.3897 1.2438 1.1018 0.97905 0.81182 0.76313
## Proportion of Variance 0.2817 0.1756 0.1406 0.1104 0.08714 0.05991 0.05294
## Cumulative Proportion 0.2817 0.4572 0.5979 0.7083 0.79540 0.85531 0.90825
## PC8 PC9 PC10 PC11
## Standard deviation 0.65111 0.58670 0.4260 0.24409
## Proportion of Variance 0.03854 0.03129 0.0165 0.00542
## Cumulative Proportion 0.94679 0.97809 0.9946 1.00000
Taking a closer look, we can see that the first Component explains roughly a 30% of variance. Scree Plots, will help to visualize the marginal variance gained with each next Component.
plot(all.pca.1, type = "l", main = "Scree Plot for All Wines")
plot(whi.pca.1, type = "l", main = "Scree Plot for White Wines")
plot(red.pca.1, type = "l", main = "Scree Plot for Red Wines")
Now, it is clear, that we can extract best results from the White Wines, as the first Component is explaining a lot of variance, and additionally, the rest is becoming marginally effective in adding to cumulative variance. The less obvious are Red wines and the All Wines dataset. We can deduct, that the complexity of red type is larger, and more factors are needed to explain the wine.
fviz_pca_var(all.pca.1) + ggtitle("Directions for the first two PC in All Wines")
fviz_pca_var(whi.pca.1) + ggtitle("Directions for the first two PC in White Wines")
fviz_pca_var(red.pca.1) + ggtitle("Directions for the first two PC in Red Wines")
While analysing those plots, we need to keep in mind, that we rely only on a part of the variance explained, however in most cases it can be interpreted as good proxy for the whole dataset. We know, that certain Principal Components can be used to define a finite variability in the dataset, but which variables to choose from the initial list of 11, to perform the best clustering? Let us then, take a closer look at the added value of each factor in each of the 4 Principal Components.
You might also take a look of the individual contributions, but with this size of the data set, it would not be an efficient way.
fviz_contrib(all.pca.1, "var", axes = 1)
fviz_contrib(all.pca.1, "var", axes = 2)
fviz_contrib(all.pca.1, "var", axes = 3)
fviz_contrib(all.pca.1, "var", axes = 4)
fviz_contrib(whi.pca.1, "var", axes = 1)
fviz_contrib(whi.pca.1, "var", axes = 2)
fviz_contrib(whi.pca.1, "var", axes = 3)
fviz_contrib(whi.pca.1, "var", axes = 4)
fviz_contrib(red.pca.1, "var", axes = 1)
fviz_contrib(red.pca.1, "var", axes = 2)
fviz_contrib(red.pca.1, "var", axes = 3)
fviz_contrib(red.pca.1, "var", axes = 4)
Finally, we might take a look, a the variance maximizing algorithm, with a priori assumption of the number used factors (and we decided, that 4 is maximizing out marginal gain from each next factor). By analyzing, how our variables behave in those Components, we can make the final decision on the variables, that will stay in clustering stage. For the sake of curiosity, you might also want to minimize the number of factors, given your assumed minimal level of variance (quartimax).
varimax.all = principal(wine.all.noID, nfactors = 4, rotate = "varimax")
varimax.whi = principal(wine.whi.noID, nfactors = 4, rotate = "varimax")
varimax.red = principal(wine.red.noID, nfactors = 4, rotate = "varimax")
print(loadings(varimax.all), digits = 3, cutoff = .4, sort = T)
##
## Loadings:
## RC1 RC2 RC3 RC4
## volatile.acidity -0.626 -0.402
## free.sulfur.dioxide 0.834
## total.sulfur.dioxide 0.835
## residual.sugar 0.435 0.658
## density 0.900
## alcohol -0.820
## fixed.acidity -0.512 0.604
## citric.acid 0.799
## pH -0.718
## chlorides 0.607
## sulphates 0.864
##
## RC1 RC2 RC3 RC4
## SS loadings 2.438 2.264 1.710 1.640
## Proportion Var 0.222 0.206 0.155 0.149
## Cumulative Var 0.222 0.427 0.583 0.732
print(loadings(varimax.whi), digits = 3, cutoff = .4, sort = T)
##
## Loadings:
## RC1 RC2 RC3 RC4
## residual.sugar 0.799
## free.sulfur.dioxide 0.629 -0.412
## total.sulfur.dioxide 0.766
## density 0.891
## alcohol -0.745
## fixed.acidity 0.797
## citric.acid 0.561
## pH -0.731
## volatile.acidity 0.703
## chlorides 0.525 0.540
## sulphates 0.702
##
## RC1 RC2 RC3 RC4
## SS loadings 3.047 1.680 1.159 1.150
## Proportion Var 0.277 0.153 0.105 0.105
## Cumulative Var 0.277 0.430 0.535 0.640
print(loadings(varimax.red), digits = 3, cutoff = .4, sort = T)
##
## Loadings:
## RC1 RC2 RC3 RC4
## fixed.acidity 0.911
## citric.acid 0.741 0.449
## density 0.779 -0.433
## pH -0.731
## free.sulfur.dioxide 0.881
## total.sulfur.dioxide 0.876
## volatile.acidity -0.721
## alcohol 0.777
## chlorides 0.801
## sulphates 0.758
## residual.sugar 0.460
##
## RC1 RC2 RC3 RC4
## SS loadings 2.865 1.802 1.677 1.446
## Proportion Var 0.260 0.164 0.152 0.131
## Cumulative Var 0.260 0.424 0.577 0.708
Before we cluster the initial datasets, we might want to excercise the results of PCA a little bit more. DBSCAN is a great tool, to reduce noise in your data. In kMeans or PAM, you will always classify all datapoints. I mentioned outliers before – it is not always the case, that yor would like to group every datapoint. Sometimes, by extracting the point far away from the median/mean or center of mass of your dataset, you may obtain better fit. Of course, you are losing some of the information, but there are use cases, where you would like to explain black swans, and measure risk by understanding the density of the outliers, but there are also use cases, where you would like to apply your knowledge to the “business as usual” or ordinary, standard datapoints.
Applying PCA to our multidimensional dataset, we converted information for explainability. By losing around +50% of variance, we are now able to visualize our data using 2 dimensions, which sometimes, could mean a lot. Using DBSCAN, we will now stress the hypothesis, whether we could classify the wines, by relying only on those two artificial axes (which have a cumulative variance of around 50%).
varimax.all.2 = principal(wine.all.noID, nfactors = 2, rotate = "varimax")
kNNdistplot(varimax.all.2$scores, k = 5)
abline(h = 0.50, lty = 2)
abline(h = 1.00, lty = 2)
abline(h = 1.50, lty = 2)
abline(h = 2.00, lty = 2)
abline(h = 2.50, lty = 2)
db.all = dbscan(varimax.all.2$scores, eps = 0.25, minPts = 5)
fviz_cluster(db.all, varimax.all.2$scores, geom = "point")
varimax.whi.2 = principal(wine.whi.noID, nfactors = 2, rotate = "varimax")
kNNdistplot(varimax.whi.2$scores, k = 5)
abline(h = 0.50, lty = 2)
abline(h = 1.00, lty = 2)
abline(h = 1.50, lty = 2)
abline(h = 2.00, lty = 2)
abline(h = 2.50, lty = 2)
db.whi = dbscan(varimax.whi.2$scores, eps = 0.25, minPts = 5)
fviz_cluster(db.whi, varimax.whi.2$scores, geom = "point")
varimax.red.2 = principal(wine.red.noID, nfactors = 2, rotate = "varimax")
kNNdistplot(varimax.red.2$scores, k = 5)
abline(h = 0.50, lty = 2)
abline(h = 1.00, lty = 2)
abline(h = 1.50, lty = 2)
abline(h = 2.00, lty = 2)
abline(h = 2.50, lty = 2)
db.red = dbscan(varimax.red.2$scores, eps = 0.30, minPts = 5)
fviz_cluster(db.red, varimax.red.2$scores, geom = "point")
Results od DBSCAN are interesting and helpful in assessing the quality of 1st and 2nd PC. We can clearly see, that there are multiple points that were identified as outliers. It means, that those 2 Principal Components are not enough, to differentiate between points in our dataset. There are also several groups of just a few points, which could be interesting if we would like to research the data deeper.
We first start by checking the optimal number of clusters based on the explained variance and the silhouette. From the results below, we interpret that for the full dataset, the most optimal division is into 2 or 3 clusters. As we would like to stress the hypothesis stated at the beginning, we will be performing 2-groups clustering for this dataset. White Wines, should be divided into 3 groups, while the Red ones into 6 or 7.
Optimal_Clusters_KMeans(wine.all.noID, max_clusters = 6, criterion = "variance_explained")
## [1] 1.0000000 0.7852835 0.6373691 0.5694019 0.5734158 0.5073619
## attr(,"class")
## [1] "k-means clustering"
Optimal_Clusters_KMeans(wine.all.noID, max_clusters = 6, criterion = "silhouette")
## [1] 0.0000000 0.2517288 0.2269697 0.2342679 0.1484660 0.1728824
## attr(,"class")
## [1] "k-means clustering"
Optimal_Clusters_KMeans(wine.whi.noID, max_clusters = 9, criterion = "variance_explained")
## [1] 1.0000000 0.7895946 0.7326235 0.6681650 0.6256485 0.5941730 0.5674527
## [8] 0.5461139 0.5237282
## attr(,"class")
## [1] "k-means clustering"
Optimal_Clusters_KMeans(wine.whi.noID, max_clusters = 9, criterion = "silhouette")
## [1] 0.0000000 0.2090672 0.2411565 0.1864664 0.1656078 0.1560383 0.1436044
## [8] 0.1421155 0.1378092
## attr(,"class")
## [1] "k-means clustering"
Optimal_Clusters_KMeans(wine.red.noID, max_clusters = 9, criterion = "variance_explained")
## [1] 1.0000000 0.8147897 0.7178377 0.6519614 0.6080868 0.5358833 0.4910852
## [8] 0.4716332 0.4836348
## attr(,"class")
## [1] "k-means clustering"
Optimal_Clusters_KMeans(wine.red.noID, max_clusters = 9, criterion = "silhouette")
## [1] 0.0000000 0.1941571 0.1722255 0.1587840 0.1505925 0.1976011 0.1999716
## [8] 0.1703923 0.1355746
## attr(,"class")
## [1] "k-means clustering"
Below, we cluster the data, vizualize the clusters and check the silhouette. I am using both, euclidean and manhattan measure, to compare them, which one deals better with high dimensionality.
kmeans.euc.all = eclust(wine.all.noID, "kmeans", hc_metric = "euclidean", k = 2)
fviz_cluster(kmeans.euc.all)
fviz_silhouette(kmeans.euc.all)
## cluster size ave.sil.width
## 1 1 2192 0.30
## 2 2 4271 0.12
kmeans.man.all = eclust(wine.all.noID, "kmeans", hc_metric = "manhattan", k = 2)
fviz_cluster(kmeans.man.all)
fviz_silhouette(kmeans.man.all)
## cluster size ave.sil.width
## 1 1 2192 0.30
## 2 2 4271 0.12
kmeans.euc.whi = eclust(wine.whi.noID, "kmeans", hc_metric = "euclidean", k = 3)
fviz_cluster(kmeans.euc.whi)
fviz_silhouette(kmeans.euc.whi)
## cluster size ave.sil.width
## 1 1 1617 0.12
## 2 2 1461 0.13
## 3 3 1792 0.17
kmeans.man.whi = eclust(wine.whi.noID, "kmeans", hc_metric = "manhattan", k = 3)
fviz_cluster(kmeans.man.whi)
fviz_silhouette(kmeans.man.whi)
## cluster size ave.sil.width
## 1 1 1617 0.12
## 2 2 1461 0.13
## 3 3 1792 0.17
kmeans.euc.red = eclust(wine.red.noID, "kmeans", hc_metric = "euclidean", k = 7)
fviz_cluster(kmeans.euc.red)
fviz_silhouette(kmeans.euc.red)
## cluster size ave.sil.width
## 1 1 46 0.07
## 2 2 246 0.22
## 3 3 187 0.14
## 4 4 29 0.35
## 5 5 496 0.23
## 6 6 267 0.18
## 7 7 322 0.15
kmeans.man.red = eclust(wine.red.noID, "kmeans", hc_metric = "manhattan", k = 7)
fviz_cluster(kmeans.man.red)
fviz_silhouette(kmeans.man.red)
## cluster size ave.sil.width
## 1 1 46 0.07
## 2 2 246 0.22
## 3 3 187 0.14
## 4 4 29 0.35
## 5 5 496 0.23
## 6 6 267 0.18
## 7 7 322 0.15
The results are very interesting. We have sufficient amount of silhouette within the clusters. It is not the highest, but given the multidimensionality and the fact that, even wines of the same colour and with the same quality could be different, I am happy with the result. What is also interesting is that we have (especially in the Red Wine data) certain groups, that are very tight together like cluster number 3, with relatively high silhouette, but we have also missclassified datapoints, what the negative silhouette indicates.
comPart(
cclust(wine.all.noID, 2, dist = "euclidean"),
cclust(wine.all.noID, 2, dist = "manhattan"))
## ARI RI J FM
## 0.9493695 0.9761502 0.9622770 0.9807788
comPart(
cclust(wine.whi.noID, 3, dist = "euclidean"),
cclust(wine.whi.noID, 3, dist = "manhattan"))
## ARI RI J FM
## 0.4371024 0.7486790 0.4560968 0.6264670
comPart(
cclust(wine.red.noID, 7, dist = "euclidean"),
cclust(wine.red.noID, 7, dist = "manhattan"))
## ARI RI J FM
## 0.4459258 0.8349847 0.3754749 0.5477388
Clustering with different distance metrics are rather not stable, as the Rand Index, and similar measures indicates. Below are set of plots used to visualize distance from each cluster centroid. Left plot is for euclidean, while the other one is for manhattan metric. First pair is for All data, second for White Wines and the last pair is based on Red Wines segmentation
stripes(cclust(wine.all.noID, 2, dist = "euclidean"))
stripes(cclust(wine.all.noID, 2, dist = "manhattan"))
stripes(cclust(wine.whi.noID, 3, dist = "euclidean"))
stripes(cclust(wine.whi.noID, 3, dist = "manhattan"))
stripes(cclust(wine.red.noID, 7, dist = "euclidean"))
stripes(cclust(wine.red.noID, 7, dist = "manhattan"))
In conclusion, one can easily see, that it is not simple, to run such analysis. At every stage, you might wonder, whather you made the right decision, and there are dozen of ways to evaluate or measure the effects of this decision. This Primier is barely touching those methods. PCA, DBSCAN and kMeans Clustering have very rich theoretical background, in which I deliberately did not immerse myself, but I definitely recommend you to get familiar with.
As for the results, we obtained a number of differend Principal Components, we also tried to rotate the dimensions in a such way, to optimize the number of variance with just 4 of them. Nevertheless, it did not help to deliver a substantial amount of variance, to just drop the initial dataset and work on an artificial one. Moreover, most of the Components were created using almost every variable, what made the interpretation and explaiability of new features rather not possible. DBSCAN showed us, that a lot of variance is still needed to fully understand the behaviour of Wine Data, because while we limited the explainability by a fixed number of Components, we lost a major part of ability to differentiate.
Outliers were a major challenge during this analysis, but for the sake of learning, I decided to leave them in order to explain and show, why they might cause problems in certain analyses.